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Abstract 

The cross-entropy (CE) method is simple and versatile technique for optimization, 
based on Kullback-Leibler (or cross-entropy) minimization. The method can be applied to 
a wide range of optimization tasks, including continuous, discrete, mixed and constrained 
optimization problems. The new package CEoptim provides the R implementation of the 
CE method for optimization. We describe the general CE methodology for optimization 
and well as some useful modifications. The usage and efficacy of CEoptim is demonstrated 
through a variety of optimization examples, including model fitting, combinatorial opti¬ 
mization, and maximum likelihood estimation. 

Keywords: Constrained optimization, continuous optimization, cross-entropy, discrete opti¬ 
mization, Kullback-Leibler divergence, lasso, maximum likelihood, R, regression. 


1. Introduction 

The cross-entropy (CE) method originates from an adaptive variance minimization algorithm 
in Rubinstein (1997) for the estimation rare event probabilities in stochastic networks. It 
was realized in Rubinstein (1999) that many optimization problems could be converted into a 
rare-event estimation problems, providing a rare-event based approach to optimization, where 
a sequence of probability densities is generated that converges to a degenerate density that 
concentrates its mass close to the optimizer. 

Generally, the CE method involves two iterative phases: 

1. Generation of a set of random samples (vectors, trajectories, etc.) according to a spec¬ 
ified parameterized model. 

2. Updating of the model parameters, based on the best samples generated in the previous 
step. This is done by Kullback-Leibler (also called cross-entropy) minimization. 

Since the appearance of the CE monograph (Rubinstein and Kroese 2004) and the tutorial 
(De Boer et al. 2005), the CE method has continued to develop and has been successfully 
applied to a great variety of difficult optimization problems, including motion planning in 
robotic systems (Kobilarov 2012), electricity network generation, (Kothari and Kroese 2009), 
control of infectious diseases (Sani and Kroese 2008), buffer allocation (Alon et al. 2005), 
Laguerre tessellation (Duan et al. 2014), and network reliability (Kroese et al. 2007). An ex¬ 
tensive list of recent work can be found in (Botev et al. 2013). Websites that provide Matlab 
code include www.cemethod.org and www.montecarlohandbook.org. Since R has become an 
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essential tool for statistical computation, it is useful to provide an accessible implementation 
of the CE method for R users, similar to R packages for simulated annealing (Xiang et al. 
2013), evolutionary methods (Mullen et al. 2011), and particle swarm optimization methods 
(Bendtsen 2012). 

Some advantages of the CE method are: 

• The CE method is a global optimization method which is particularly useful when the 
objective function has many local optima. 

• The CE method can be used to solve continuous, discrete, and mixed optimization 
problems, which may also include constraints. 

• The CE code is extremely compact and is readily written in native R, making further 
development and modifications easy to implement. 

• The CE method is based on rigorous mathematical and statistical principles. 

Our aim is not to replace the standard optimization solvers such as optim and nlm but to 
provide a viable alternative in cases where standard gradient or simplex-based solvers are 
not applicable (e.g., when the optimization problem contains both discrete and continuous 
variables) or are expected to do poorly (e.g., when there are many local optima). 

The rest of this paper is organized as follows. In Section 2, we sketch the general theory behind 
the CE method, which leads to the basic CE algorithm. In Section 3, we describe a variety 
of optimization scenarios, including continuous, discrete and constrained mixed problems, to 
which CE can be applied effectively. The description and usage of the CEoptim package are 
given in Section 4. Section 5 demonstrates the capability of the package through a range of 
numerical examples. In the final section we make concluding remarks for CEoptim. 


2. CE method for optimization 

Let be an arbitrary set of states and let S' be a real-valued performance function on £%£. 
Suppose the goal is to find the minimum of S over , and the corresponding minimizer x* 
(assuming, for simplicity, that there is only one). Denote the minimum by 7 *, so that 

S(x*) = 7* = min S(x). (1) 


The CE methodology for optimization is adapted from the CE methodology for rare event 
estimation in the following way. Associate with the above problem ( 1 ) the estimation of 
the probability £ = P(S(X) ^ 7), where X has some probability density /(x;u) on (for 
example corresponding to the uniform distribution on £%f) depending on a parameter u and 
a level 7. Thus, for optimization problems randomness is purposely introduced in order to 
make the model stochastic. If 7 is chosen close to the unknown 7*, then £ is typically a 
rare-event probability. One of the most effective ways to estimate rare-event probabilities is 
to use importance sampling. In particular, to estimate £ = P(5(X) ^ 7) one can use the 
importance sampling estimator 


£ = 


N 


N 

E 


/(X. t 


Tt 5(Xi 


■I{5(Xi) ^ 7 }, 
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where Xi,..., Xjv are iid samples from a well-chosen importance sampling density g. The 
optimal importance sampling density is in this case <?*(x) = /(x)I{5'(x) ^ 7 }/^, which gives 
a zero-variance estimator, but depends on the unknown quantity i. The main idea behind the 
CE method for estimation is to adaptively determine an importance sampling pdf /(x; v*) — 
hence within the same family as the original distribution — that is close to g* in Kullback- 
Leibler sense. Specifically, a parameter v* is sought that minimizes the cross-entropy distance 


W,/(-;v)) = 


In 


g*(X) ~ 

/(X;v)_ 


= / 9* ( x ) In g* (x) dx 


g*(x) ln/(x; v) dx . 


This is equivalent to maximizing, with respect to v, 


j /(x; u)I{5(x) ^ 7 } ln/(x; v) dx = E u [I{5(X) ^ 7 } ln/(X; v)] , 


which in turn can be estimated by maximizing the sample average 

1 N 

-^[I{S(X^ 7 }ln/(X i; v)] , (2) 

i=l 

where Xi,...,Xjv is an iid sample from /(x;u). This is, in essence, maximum likelihood 
estimation. In particular, (2) gives the maximum likelihood estimator of v based on only 
the samples Xi,..., Xjy that have a function value less than or equal to 7 . These are the 
so-called elite samples. 

The relevance to optimization is that when 7 is close to the (usually unknown) minimum 
7 *, then the importance sampling density g* concentrates most of its mass in the vicinity of 
the minimizer x*. Sampling from such a distribution thus produces optimal or near-optimal 
states. The CE method for optimization produces a sequence of levels ( 7 *) and reference 
parameters (v*) determined from ( 2 ) such that the former tends to the optimal 7 * and the 
latter to the optimal reference vector v*, where /(x; v*) corresponds to the point mass at x*; 
see, e.g., (Rubinstein and Kroese 2008, Page 251). 

The generic steps for CE optimization are specified in Algorithm 1. 


Algorithm 1 Generic CE algorithm 

Input: Initial parameter vector vo- Sample size N. Rarity parameter g. 

Output: Sequence of levels ( 7 t)f=i and parameters (vt)T=i- 
1 : Let N e = \gN~\ (number of elite samples) and set t = 1 (level counter). 

2 : while the sampling distribution is not degenerate do 

3: Generate Xi,..., Xjv ~iid /(■; vt_i). Calculate the performances S'(Xj) for all i, and 

order them from smallest to largest: ^ ... ^ SAp. Let 7 * be the sample ^-quantile 

of performances; that is, 7 1 = S( N e\. 

4: Use the same sample Xi,..., Xjv and solve the stochastic program 

N 

max V'l{5(X fc ) ^ 7 *} ln/(X fc ;v) . (3) 

V Z ' 

k= 1 

Denote the solution by Vf . Increase t by 1. 

5: end while 
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To run the algorithm, one needs to provide the class of sampling densities {/(•; v)}, the initial 
vector vo, the sample size N, the rarity parameter q, and the stopping criterion. It is prudent 
to keep track of the overall best function value and corresponding state, and report these at 
the end of the algorithm as the optimal value and optimizer, respectively. The progression of 
level parameter 7 1 gives an indication how well the algorithm converges. 

As (3) is simply a maximum likelihood estimation step involving only the elite samples, it is 
possible to derive easy parameter updates for standard sampling distributions. The following 
two special cases are of particular importance. 

1. Multivariate normal distribution. Suppose each X is sampled from an re-dimensional 
multivariate normal distribution with independent components. The parameter vector 
v in the CE algorithm can be taken as the 2n-dimensional vector of means and stan¬ 
dard deviations. In each iteration these means and standard deviations are updated 
according to the sample mean and sample standard deviation of the elite samples. 

2. Multivariate Bernoulli distribution. Suppose each X is sampled from an n-dimen¬ 
sional Bernoulli distribution with independent components. The parameter vector v in 
the CE algorithm can be taken as the n-dimensional vector of success probabilities. In 
each iteration the zth success probability is updated according to the mean number of 
successes (Is) at the zth position of the elite samples. 

Remark 1 (Parameter Smoothing) Various modifications of the basic CE algorithm have 
been proposed in recent years. One such is modification is parameter smoothing , where at the 
ftli iteration the sampling parameter is updated via 

v t = av t + (1 - a) vt— 1 , (4) 

where is the solution to (3) and 0 ^ a ^ 1 is a fixed smoothing parameter. 

Smoothed updating can prevent the sampling distribution from converging too quickly to a 
sub-optimal degenerate distribution. This is especially relevant for the multivariate Bernoulli 
case where, once a success probability reaches 0 or 1 , it can no longer change. 

It is also possible to use different smoothing parameters for different components of the 
parameter vector (e.g., the means and the variances). 

Remark 2 (Choice of sampling densities) Although sampling distributions with inde¬ 
pendent components are the most convenient to use in a CE implementation, it is sometimes 
advantageous consider more complex sampling models, such as mixture models. In this case 
the updating of parameters (maximum likelihood estimation) may no longer be trivial, but 
one can instead employ fast methods such as the EM algorithm to determine the parameter 
updates. 

Remark 3 (Choice of the CE parameters) The CE method is fairly robust with respect 
to the choice of the parameters. The rarity parameter g is typically chosen between 0.01 and 
0.1. The number of elite samples N e = [giV] should be large enough to obtain a reliable 
parameter update in (3). For example, if the dimension of v is d , the number of elites should 
be in the order of 10 d or higher. 
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3. Optimization scenarios 

In this section we consider a number optimization scenarios to which CEoptim could be 
applied. 

3.1. Continuous optimization 

Consider a continuous optimization problem with state space = M n . The sampling distri¬ 
bution on M n can be quite arbitrary and does not need to be related to the objective function 
S. Usually, the random vector X = (Ah,..., X n ) T e M n is generated from a Gaussian distri¬ 
bution with independent components, characterized by a vector fi of means and a vector cr 
of standard deviations. At each iteration of the CE method, these vectors of parameters are 
updated as the means and standard deviation of the elite samples. During the course of the 
algorithm a sequence of (/x t ) and (cr*) are generated, such that tends to the optimizer x*, 
while the vector of standard deviations tends to the zero vector. At the end of the algorithm 
one should obtain a degenerated probability density with mean p T approximately equal to 
the optimizer x* and all standard deviations close to 0. A possible stopping criterion is to 
stop when all components in or are smaller than some e. This scheme is referred to as normal 
updating. 

CEoptim implements the normal updating scheme for continuous optimization. 


3.2. Discrete optimization 


If the state space & is finite, the optimization problem is often referred to as a discrete or 
combinatorial optimization problem, where could be the space of combinatorial objects, 
such as binary vectors, trees, graphs, etc. To apply the CE method to a discrete optimization 
problem, one needs a convenient parameterized random mechanism to generate samples. 

For discrete optimization CEoptim implements sampling from state spaces & of the form 
{0,1,..., ci — 1} x • • • x {0,1,..., c n — 1}, where the {c*} are strictly positive integers. The 
components of the random vector X = (AT,..., X n ) G are taken to be independent, so 
that its distribution is determined by a sequence of probability vectors pi,..., p n , with the 
jth component of p, corresponding to pij = P(A* = j). For a given elite sample set & of size 
N e , the CE updating formulas for these probabilities are 


Pij 


Exe# — j} 
N e 


i = 1,... ,n, j = 0,... ,c n — 1, 


( 5 ) 


where I denotes the indicator function. Hence, at each iteration, probability ptj is updated 
simply as the average number of times that the ith component of the elite vectors is equal to 
j. A possible stopping rule for a discrete optimization problem is to stop when the overall 
best objective value does not change over a number of iterations. Alternatively, one could 
stop when the sampling distribution has degenerated sufficiently; for example, when all {pij} 
are no further than e away from either 0 or 1. 


3.3. Constrained optimization 

The general optimization problem (1) also covers constrained optimization, where the search 
space could, for example, be defined by a system of inequalities: 

G,(x) ^ 0, i = 1 ,..., k. 


( 6 ) 
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One way to deal with constraints is to use acceptance-rejection: generate a random vector 
X on a simple search space that contains SC, and accept or reject it based on whether the 
sample falls in SC or not. Alternatively, one could try to sample directly from a truncated 
distribution on SC , e.g., using Gibbs sampling. 

CEoptim implements linear constraints for continuous optimization of the form Ax ^ b, 
where A is a matrix and b a vector. The program will use either acceptance-rejection or Gibbs 
sampling to sample from the multivariate normal distribution truncated to the constraint set. 

A second approach to handle constraints is to introduce a penalty function. For example, for 
the constraints ( 6 ), the objective function could be modified to 

k 

S(x) = S(x) + H i max{Gj(x), 0}, (7) 

2—1 

where Hi < 0 measures the importance of the ith penalty. To use the penalty approach with 
CEoptim the user simply needs to modify the objective function according to (7). The choice 
of the penalty constants {Hi} is problem specific and may need to be determined by trial and 
error. 


4. CEoptim description 

In this section we describe how to use CEoptim. 

The CEoptim function is the main function of the package CEoptim. It can be used to solve 
continuous and discrete optimization problems as well as mixtures thereof. 
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4.1. Usage 

CEoptim(f, f.arg=NULL, maximize=FALSE, continuous=NULL, discrete=NULL, 
N=100L, rho=0.1, iterThr=le4L, no!mproveThr= 5, verbose=FALSE) 


4.2. Arguments 
Argument Description 


f 

f .arg 
maximize 

continuous 

mean 

sd 

smoothMean 

smoothSd 

sdThr 

conMat 

conVec 

discrete 

categories 


probs 

smoothProb 

probThr 


N 

rho 

iterThr 


Function to be optimized. Can have continuous and discrete arguments. 

List of additional fixed arguments passed to function f. 

Logical value determining whether to maximize or minimize the objective 
function. 

List of arguments for the continuous optimization part, consisting of: 
Vector of initial means. 

Vector of initial standard deviations. 

Smoothing parameter for the vector of means. Default value 1 (no 
smoothing). 

Smoothing parameter for the standard deviations. Default value 1 (no 
smoothing). 

Positive numeric convergence threshold. Check whether the maximum 
standard deviation is smaller than sdThr. Default value 0.001. 

Coefficient matrix of linear constraint conMat x ^ conVec. 

Value vector of linear constraint linear constraint conMat x ^ conVec. 

List of arguments for the discrete optimization part, consisting of: 

Integer vector which defines the allowed values of the categorical 
variables. The ith categorical variable takes values in the set 
{0,1,.. •,categories(i) — 1}. 

List of initial probabilities for the categorical variables. Defaults to equal 
(uniform) probabilities. 

Smoothing parameter for the probabilities of the categorical sampling dis¬ 
tribution. Default value 1 (no smoothing). 

Positive numeric convergence threshold. Check whether all probabilities 
in the categorical sampling distributions deviate less than probThr from 
either 0 or 1. Default value 0.001. 

Integer representing the CE sample size. 

Value between 0 and 1 representing the elite proportion. 

Termination threshold on the largest number of iterations. 
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noImproveThr Termination threshold on the largest number of iterations during which no 

improvement of the best function value is found. 

verbose Logical value set for CE progress output. 

4.3. Value 

CEoptim returns a list with the following components, 
optimum Optimal value of f. 

optimizer List of the location of optimal value, consisting of: 

continuous Continuous part of the optimizer, 
discrete Discrete part of the optimizer. 

termination List of termination information consisting of: 

niter Total number of iterations upon termination, 

convergence One of the following termination statements: 

• Not converged, if the number of iterations reaches iterThr; 

• The optimum did not change for noImproveThr iterations, if 
the best value has not improved for noImproveThr iterations; 

• Variances converged, otherwise. 

states List of intermediate results computed at each iteration. It consists of the 

iteration number (iter), the best overall value (optimum) and the worst 
value of the elite samples, (gammat). The means (mean) and maximum 
standard deviation (maxSd) of the elite set are also included for continuous 
cases, and the maximum deviations (maxProbs) of the sampling probabili¬ 
ties to either 0 or 1 are included for discrete cases. 

states .probs List of categorical sampling probabilities computed at each iteration. Will 

only be returned for discrete and mixed cases. 

4.4. Note 

• Although partial parameter passing is allowed outside lists, it is recommended that pa¬ 
rameters names are specified in full. Parameters inside lists have to specified completely. 

• Because CEoptim is a random function it is useful to (1) set the seed for the random 
number generator (for testing purposes), and (2) investigate the quality of the results 
by repeating the optimization a number of times. 
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5. Numerical examples 

The following examples illustrate the use, flexibility, and efficacy of the CEoptim function 
from the package CEoptim. 

5.1. Maximizing the peaks function 

Suppose we wish to maximize Matlab’s well-known peaks function, given by 

S(x) = 3(1 - Xl f e -*?-(*a+ 1 ) a - 10 - x\ - xf) e~ x i~ x 2 - i e -(*i+i) 2 -*! . ( 8 ) 

\ o / o 



Figure 1: Peaks function 


The peaks function has three local maxima and three local minima, with a global maximum 
at x* « (—0.0093,1.58) of S(x.*) ps 8.1, and the other two local maximum are S'(xi) = 3.78 
at (-0.46, -0.63) and S(x 2 ) = 3.59 at (1.29, -0.0049). 

To solve the problem with CEoptim, using normal updating, we must specify the vector of 
initial means p 0 and standard deviations cr o of the 2-dimensional Gaussian sampling distribu¬ 
tion. The initial sampling distribution should cover, roughly, the region where the maximizer 
is thought to lie. As an example we take p 0 = (—3, —3) and er 0 = (10,10). The important 
point is that the standard deviations are chosen large enough. Since this is a maximization 
problem, we have to set maximize=T. For the other parameters we take their default values. 
Note that there are only four parameters to be updated in each iteration, so a sample size of 
N = 100 is suitable. 

R> require(CEoptim) 

R> fun <- function(x){3*(l-x[l] ) ~2*exp(-x [1] ~2 - (x[2]+1) ~2)-10*(x[1]/5 + 

-x[l]~3 - x[2] ~5)*exp(-x[l] ~2 - x[2]~2) + 

-l/3*exp(-(x[1]+1)~2 - x[2]~2)} 
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R> set.seed(1234) # for verification purpose only 

R> muO <- c(-3,-3); sigmaO <- c(10,10) 

R> res <- CEoptim(fun, maximize=T, continuous=list(mean=muO,sd=sigmaO)) 

R> res 

The output of this implementation is as below: 

Optimizer for continuous part: 

-0.009390034 1.581405 
Optimum: 

8.106214 

Number of iterations: 

7 

Convergence: 

Variance converged 

The reader may check that optim applied to the minimization of —/ can easily find the wrong 
optimizer, e.g., when the starting value is ( 0 , 0 ). 


5.2. Non-linear regression 


We next consider a more complicated optimization task, involving data generated from the 
well-known FitzHugh-Nagumo differential equations: 


dVf 
d t 
d Rt 
dt 


c { v ‘~Y + r ‘) ■ 

— (Vt — a + bRt) , 
c 


(9) 


which model the behavior of certain types of neurons (Nagumo et al. 1962). Ramsay et al. 
(2007) consider estimating the parameters a, b, and c from noisy observations of (Vt) by using a 
generalized smoothing approach. The simulated data in Figure 2 (saved as data(FitzHugh)) 
correspond to the values of V t obtained from (9) at times 0,0.05,..., 20.0, adding Gaussian 
noise with standard deviation 0.5. That is, we use the non-linear regression model 


Yi = Vb .05 i(x) + £i, i = 1,..., 400 , (10) 

where the {et} are iid with a N(0,<r 2 ) distribution, Vb. 05 *( x ) is the solution to (9) for time 
t = 0.05?', and x = (a, 6, c, Vo, Rq) is the vector of parameters. The true parameter values are 
here a = 0.2, b = 0.2, and c = 3. The initial conditions are Vo = — 1 and Rq = 1. 

Estimation of the parameters via the CE method can be established by minimizing the least- 
squares performance 

400 

S'(x) = (yi - ^0.05i(x)) 2 , (11) 

i=0 

where the {?/*} are the simulated data from the model (10). Note that we assume that also 
the initial conditions are unknown. 

We use the deSolve package to numerically solve the FitzHugh-Nagumo differential equations 
(9). Hereto, we first define the function FN. 
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Figure 2 : Simulated data (points) and “unknown” true curve (red). 


R> FN <- function(t,state parameters){ 

with (as. list (c (stateparameters) ) , { 
dV <- c*(V-V~3/3+R) 
dR <- -1/c*(V-a+b*R) 
list(c(dV, dR)) 

})} 

The following function ssres now implements the objective function in (11). 

R> ssres <- function (x,fundf,times,y) { 

parameters <- c(a=x[1],b=x[2],c=x[3]) 
state <- c(V=x[4],R=x[5]) 

out <- ode(y=state,times=times,func=fundf,parms=parameters) 
return(sum((out[,2]-y)~2))} 

CEoptim could be used with fj, 0 = (0,0, 5, 0,0) and er 0 = (1,1,1,1,1). Constant smoothing 
parameters a = 0.9 and /3 = 0.5 were used for the {fr t } and the respectively. To see 

the progress of the algorithm we set verbose to TRUE. The other arguments remain default. 

R> require(deSolve) 

R> require(CEoptim) 

R> set.seed(123405) 

R> times <- seq(0,20,by=0.05) 

R> data(FitzHugh) 

R> res<- CEoptim(ssres, f.par = list(fundf=FN, times=times, y=ySim), 
continuous= list(mean=c(0,0,5,0,0), sd=c(l,1,1,1,1), 
smoothMean=0.9,smoothSd=0.5 ), verbose=TRUE) 


The final output is as follows: 
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R> res 

Optimizer for continuous part: 

0.1959748 0.2395983 3.001453 -0.9938222 0.9791585 
Optimum: 

102.8005 

Number of iterations: 

41 

Convergence: 

Variance converged 

The output shows the estimates (notice that the initial condition was assumed to be unknown): 
ci = 0.1959748, b = 0.2395983, c = 3.0014526, V 0 = -0.9938222, and = 0.9791585, with the 
maximum likelihood estimate a = \/l02.8005/400 = 0.507 for the residual standard deviation 
a. The reader may check that fitted curve is practically indistinguishable from the true one 
in Figure 2. 

To illustrate how the sampling distributions change during the CE process, we have plotted 
in Figure 3 the evolution of the sampling pdf for the first parameter a, from the 15th to the 
final iteration. As can be seen from the figure, the sampling distribution converges to a point 
distribution around the optimal value for a. 



-0.4-0.2 0.0 0.2 0.4 0.6 0.8 

First parameter 


Figure 3: The evolution of the sampling pdf for the first parameter a 
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5.3. Max-cut problem 

The max-cut problem in graph theory can be formulated as follows. Given a weighted graph 
(V, E ) with node set V = {1,... ,n} and edge set E , partition the nodes of the graph into 
two subsets V] and V 2 such that the sum of the (nonnegative) weights of the edges going from 
one subset to the other is maximized. Let C = (CV, ) be the matrix of weights. The objective 
is to maximize 

E Vij + Cji) (12) 

(i,j)eVixV 2 

over all cuts [Vi, V 2 }. Such a cut can be conveniently represented by a binary cut vector 
x = (xi,X 2 , • • •, x n ), where Xi = 1 indicates that i £ V\. Let 3£ be the set of cut vectors and 
let S(x) be the value of the cut represented by x, as given in (12). 

To maximize S via the CE method one can generate the random cut vectors by drawing each 
component (except the first one, which is set to 1) independently from a Bernoulli distribution, 
that is, X = (X±,X 2 , . . ■ ,X n ) ~ Ber(p), where p = (1,P2, ■ ■ ■ ,Pn )■ In this case the updated 
success probability for the ith component is the mean of the z-tli components of the vectors 
in the elite set. 

As an example, consider the network from Knuth (1993) describing the coappearances of 77 
characters from Victor Hugo’s novel Les Miserables. Each node of the network represents a 
selected character and edges connect any pair of characters that coappear. The weights of 
the edges are the number of such coappearances. Using CEoptim, the data can be loaded via 
the command data(lesmis). The network is displayed in Figure 4, using the graph analysis 
package sna. 

R> library(sna) 

R> library(CEoptim) 

R> data(lesmis) 

R> gplot(lesmis,gmode="graph") 

For any fixed cost matrix costs and cut vector x, the objective function of the max-cut 
problem can be written as: 

R> fmaxcut <- function(x,costs){ 
vl <- which(x==l) 
v2 <- which(x==0) 
return( sum(costs[vl,v2])) } 

To optimize this function with the CEoptim package, we specify the following arguments: 
discrete$probs={(0,1); (0.5.0.5) ; . . . ; (0.5,0.5)}, sample size N=3000 and optimiza¬ 
tion type: maximize=T. To see the output we set verbose=TRUE. The other arguments are 
taken as default. Note that users only need to specify either categories or probs, if both of 
them are specified, then categories will be overridden. 

R> set.seed(5) 

R> p0<-list() 

R> for(i in 1:77){p0<-c(p0 ,2ist (rep(0.5,2) ) )} 

R> pOLLl]] = c (0,1) 
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Figure 4: Network of coappearances 


R> res <- CEoptim(fmaxcut,f.arg=list(costs=lesmis),maximize=T, 
verbose=TRUE,discrete=list(probs=pO),N=3000L) 

R> ind <- res$optimizer$discrete 

R> groupl <- colnames(lesmis)[which(ind==TRUE)] 

R> group2 <- colnames(lesmis)[which(ind==FALSE)] 

The output of CEoptim is as follows: 


R> res 

Optimizer for discrete part: 


1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

1 

1 

1 

0 

0 

1 

1 

0 

1 

0 

1 

0 

1 

1 

1 

1 

0 

0 

1 

1 

1 

1 

1 

1 

0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 

1 

1 

1 

0 

1 

1 

1 

0 

0 

0 

0 

1 

1 

0 

1 

0 

0 

1 

1 

1 

1 

0 

0 

1 

0 

1 

0 

0 

0 

















Optimum: 

535 

Number of iterations: 

20 

Convergence: 

Optimum did not change for 5 iterations 

Note that character 1 (Myriel) is always in groupl. The initial probabilities for the other 
characters are 0.5. With states .probs, we can plot the evolution of the probabilities that 
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each character belongs to groupl; see Figure 5. 

R> probs <- res$states.probs 

R> X <- matrix(NA,nrow=length(probs),ncol=77) 

R> probO <- cbind(l,t(rep(0.5,76))) 

R> ford in 1: length (probs)){ 
for(j in 1:77){ 

X[i,j] <_ res$states.probs[[i]][[j]][2] 

} 

} 

R> X <- rbind(probO,X) 

R> par(mfcol=c(5,2),mar=c(1,1.5,1,1.5),oma=c(l,1,1,1)) 

R> ford in 1:5){ 

plot(X[i,],type="h",lwd=4,col="blue",ylim=c(0,1),xaxt="n",yaxt="n",ylab="", 
main=paste("t=",i-1,sep="")) 
axis(2,at=0.5,labels=0.5)} 

R> ford in 1:5){ 

plot(X[l+4*i,J,type="h",lwd=4,col="blue",ylim=c(0,1),xaxt="n”,yaxt="n",ylab="", 
main=paste("t=",l+4*i,sep="")) 
axis(2,at=0.5,labels=0.5)} 


i=a 


in 

o 





t=9 





t=13 





LO 


lUiiMMlI 




Figure 5: Evolution of categorical sampling probabilities that characters in group 1. 
Based on the output above, the two groups of characters are indicated in Table 1: 
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groupl 

Myriel, MlleBaptistine, Labarre, 
MmeDeR, Isabeau, Gervais, Fameuil, 
Blacheville , Dahlia, Fantine, 
Thenardier, Cosette, Javert, 
Fauchelevent, Simplice, Scaufflaire, 
Oldwomanl, Judge, Champmathieu, 
Brevet, Eponine, 01dwoman2, 

Jondrette, Gavroche, Gillenormand, 
Magnon, MmePontmercy, MlleVaubois, 

LtGillenormand, Combeferre, 

Prouvaire, Courfeyrac, Joly, 
Grantaire, MotherPlutarch, Gueulemer, 
Montparnasse, Childl 


group2 

Napoleon, MmeMagloire, CountessDeLo, 
Geborand, Champtercier, Cravatte, 
Count, OldMan, Valjean, Marguerite, 
Tholomyes, Listolier, Favourite, 
Zephine, MmeThenardier, Bamatabois, 
Perpetue, Chenildieu, Cochepaille, 
Pontmercy, Boulatruelle, Anzelma, 
MotherInnocent, Gribier, MmeBurgon, 
MlleGillenormand, Marius, BaronessT, 
Mabeuf, Enjolras, Feuilly, Bahorel, 
Bossuet, Babet, Claquesous, Toussaint, 
Child2, Brujon, MmeHucheloup 


Table 1: Two groups of characters providing a maximal cut. 


We have run the program for 1000 times randomly. In 312 cases the optimal solution (535) 
was found. The frequency of the results of CEoptim is given in Figure 6. 


312 



528 529 530 531 532 533 534 535 


Figure 6: Frequency of best max-cut values found by CEoptim. 
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5.4. Constrained minimization of the griewank function 

To illustrate constrained optimization with CEoptim, we consider the minimization of the 
griewank function, which is widely used to test the convergence of optimization algorithms. 
The griewank function of order n is defined as 


S(x) 


1 + d 00 

i= 1 i=1 


n 


cos 


Xi 


(13) 


where x = (xi,..., x n ) T takes values in some subset of M n . The function has many local 
minima with (in the unconstrained case) a global minimum at x* = (0,..., 0) of S(x*) = 0. 

We wish to minimize the griewank function of order 2 over the triangle with vertex points 
(1,4), (4,0), and (8,4); see Figure 7. 



-2 0 2 4 6 8 10 


Figure 7: Contour plot of the griewank function and the triangular constraint region. The 
optimal solution (indicated by a cross) lies on the boundary of the constraint region. 


The constraint set can be written as the linearly constrained region {x E M 2 : Ax ^ b} with 



To solve the problem with CEoptim we proceed as follows: 


R> require(CEoptim) 

R> set.seed(123) 

R> griewank <- function(X) { 
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p <- length(X) 
r <- c() 

for (i in 1:p) { 

r[i ] <- cos (X [i] /sqrt (i) ) 

} 

return(1+sum(X~2)/4000-prod(r)) 

} 

R> A <- rbind(c(0,l),c(-1,-1),c(1,-1)) 

R> b <- c(4,-4,4) 

R> res <- CEoptim(griewank,continuous=list(mean=c(0,0), sd=c(10,10), conMat=A, 
conVec=b), rho=0.1, N=200L, verbose=TRUE, noImproveThr=Inf) 

R> cat("direct optimizer =", res$optimizer$continuous,"\n") 

R> cat("direct minimum =",res$optimum,"\n") 

The corresponding output shows that the minimum is obtained at the boundary of the trian¬ 
gle. 

R> direct minimizer = 3.139669 3.991955 
R> direct minimum = 0.05685487 

It is also possible to use a penalty approach for this problem. Here we take the penalty 
function 

S(x) = S(x) + 100 ||Hx — b||, 

which can be implemented in the following way. 

R> griewank.penalty <- function(X,A,b) { 
fn <- griewank(X) 
if (any(A/*Z as.vector(X) > b)){ 

penalty <- norm(A/*Z as.vector(X)- b) 
fn <- fn + 100*penalty} 
return (fn) 

} 

The optimization now proceeds as follows (note that we have also changed rho and N): 

R> set.seed(123) 

R> res,pen<- CEoptim (griewank.penalty,f.arg=list(A,b),continuous=list(mean=c(0, 
sd=c(10,10)),rho=0.01,N=2000L,verbose=TRUE,no!mproveThr=Inf) 

R> cat("penalty minimizer =",res,pen$optimizer$continuous,"\n") 

R> cat("penalty minimum =",griewank(res,pen$optimizer$continuous),"\n") 

This leads to practically the same result: 

R> penalty minimizer = 3.139757 4 
R> penalty minimum = 0.055103 
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5.5. Dirichlet data 

Suppose that we are given a random sample of data from a Dirichlet(a:) distribution, where 
a = («i,..., qlk+ i) T is an unknown parameter vector satisfying ccj > 0, i = 1,..., K + 1. 
Recall that the pdf of a random vector Y = (Y \,..., Yk) ~ Dirichlet(a) is given by 


£( \ J- a i) IT c 

{ n S TK,n, 


&K+ 1-1 


for x t ^ 0, i = 1,..., K and i Vi ^ 1 > where F is the gamma function. The conditions on 
a provide natural inequality constraints: Gi(a ) = —ccj ^ 0, i = 1,..., K + 1. 

We will use CE method to obtain the maximum likelihood estimate by direct maximization 
of the log-likelihood for the Dirichlet distribution given the data. 

For a particular example, a data size of n = 100 points are sampled from the Dirichlet(l, 2, 3,4, 5) 
distribution with the assistance of the function rdirichlet in the CEoptim package. 


R> require(CEoptim) 

R> set.seed(12345) 

R> a <- 1:5 

R> K <- length (a)-1 

R> n <- 100 

R> y <- dirichletrnd(a,n) 


To use CEoptim to solve the MLE problem. The objective function is written as follows: 

R> dirichletLoglike <- function(alpha,Y,n,K){ 

t <- apply(Y,MARGIN=1,function(y){sum((alpha[1:K]-l)*log(y[l:K]))+ 

(alpha [K+lJ -l)*log(1-sum (y[l :KJ))}) 
out <- n*(log(gamma(sum(alpha)))-sum(log(gamma(alpha))))+sum(t) 
return(out)} 

The CE parameters are initial mean vector /x = (0,0,0,0,0) and standard deviation vector 
o' = (10,10,10,10,10). The sample size of N = 10 1 and the elite ratio is default. To pass the 
linear constraints that cc* > 0, i = 1,..., K + 1, the coefficient matrix is 

/-I 0 0 0 0 \ 

0-10 0 0 

A= 0 0 -1 0 0 , 

0 0 0 -1 0 

\ 0 0 0 0 -1/ 

and the constraint vector is b = (0,0,0,0,0). No smoothing parameter is applied to the 
mean vector, but a constant smoothing parameter of smoothSd=0.5 is applied to each of the 
standard deviations. This is a maximization problem, so set maximize=T. 

R> muO <- rep(0,times=K+l) 

R> sigmaO <- rep(10,times=K+l) 
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R> A <- matrix(rep(0,times=25),nrow=5) 

R> diag(A)<- rep(-l,times=5) 

R> b <- rep(0,times=5) 

R> res <- CEoptim(dirichletLoglike,f.arg=list(Y=y,n=100,K=4),maximize=T, 

continuous=list(mean=muO,sd=sigmaO,conMat=A,conVec=b,smoothSd=0.5), 

N=10000L,verbose=TRUE) 

With the returned states variable, we can plot the evolution of optimal values per iteration, 
as shown in Figure 8, where the upper line indicates the best value found so far, while the 
lower line gives the worst value of the current elite sample. 

R> par(mai=c(0.6,1,0.5,0.2),oma=c(0,0,0,1)) 

R> plot (res$states [, 'iter'J ,res$states [,'gammat'J , type-s', col="blue",xlab="",ylab=" ") 
R> lines (res$states [, 'optimum'] , type-s', col = "red") 


o 



Figure 8 : Evolution of the best value (upper line) and the worst value of the best (elite) 
samples (lower line) 


R> res 

Optimizer for continuous part: 

1.111656 2.000186 3.534268 3.983616 5.142336 
Optimum: 

486.2124 

Number of iterations: 

35 

Convergence: 

Variance converged 

Maximum likelihood estimates for Dirichlet data can be computed to high accuracy via the 
fixed-point techniques of Minka (2000). This requires sophisticated numerical techniques for 
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inverting digamma functions. When applying this method to the same Dirichlet(l, 2,3,4, 5) 
data, we obtained the estimate S = (1.111715,2.000243, 3.534321, 3.983752, 5.142596), with 
a likelihood value of 486.2124, giving excellent agreement between the two approaches. 


5.6. Lasso regression 

Suppose that we observed some data from the following model: 

Y t = xJ (3 + i = 1 ,..., n , 

where Xj = (xn ,..., Xi p ) T is the p -vector of explanatory variables, (3 = (/3i,. .., /3 P ) T is the p- 
vector of regression coefficients, and the {e*} are the noise terms with E[ej] = 0, Var[e,] = o 2 , 
for all i and Cov(ej,£j) = 0 (Vz ^ j ). Consider a Lasso regression approach to estimate the 
regression vector f3: 


-Masso 

P 


n p 

argmin— ^(y - xj(3) 2 + A |/3j| 


/3eiRp 


2n 

1 


Z=1 


3 = 1 


argmin —||y-y/3|| 2 +A ||/3||i , 
/3GMP _ x 

Penalty 


Loss 


where Y = (Y\,... ,Y n ) T and X = (xi,... ,x n ) T is the (n x p) design matrix. The tuning 
parameter A controls the amount of regularization. 

For a given value of A, we will use CE method to obtain the Lasso regression coefficient and 
compared our results with those obtained by the function glmnet from the package glmnet 
presented by Friedman et al. (2008). 

We generate data of size n = 150, with p = 60 explanatory variables independently generated 
from a standard normal distribution. The true coefficients from (3 are chosen such that 10 
are large (between 0.5 and 1) and 50 are exactly 0. The variance of the noise is equal to 1. 


R> set.seed(10) 

R> n <- 150 
R> p <- 60 

R> beta <- c(runif(10,0.5,1),rep(0,50)) 
R> X <- matrix(rnorm(n*p),ncol=60) 

R> Y <- X°/,*°/ 0 matrix(beta,ncol=l)+rnorm(n) 


We first use the glmnet function to find the Lasso regression coefficient that gives a sparsity 
of 10 ; that is, exactly 10 coefficients are non-zero. 

R> require(glmnet) 

R> res.glmnet <- glmnet(X,Y) 

# Find the lambda value to get a model with a sparsity=10 
R> sparsity.10 <- which(res.glmnet$df==10) 

R> (lambda.10 <- res.glmnet$lambda[sparsity.10[1]]) 


0.2731371 



22 


CEoptim: Cross-Entropy R package for Optimization 


R> beta.glmnet <- res.glmnet$beta[, sparsity.10[1]] 

The corresponding indices are correctly identified by glmnet: 

# Index of the non-zero coefficient 

R> (ind.beta <- which(res.glmnet$beta[,sparsity.10[1]]!=0)) 

VI V2 V3 V4 V5 V6 V7 V8 V9 V10 
123456789 10 

# Values of the non-zero coefficient (NZ) 

R> (beta.glmnet.NZ <- res.glmnet$beta [ind. beta,sparsity.10 [1]]) 

VI V2 V3 V4 V5 V6 

0.39006188 0.39345242 0.40795534 0.57510345 0.18776598 0.19553092 

V7 V8 V9 V10 

0.02929225 0.55435619 0.57656731 0.56279719 

We now use our function to estimate the Lasso regression function for the given A = 0.2731371. 
R> require(CEoptim) 

R> RSS.penalized <- function(x,X,Y, lambda)■[ 

out <- (l/2)*mean((Y-X°/ 0 *y o matrix(x,ncol=l ,nrow=dim(X) [2] ,byrow=TRUE))**2) 
+ lambda*sum(abs(x)) 
return (out)} 

R> muO <- rep(0,times=p) 

R> sigmaO <- rep(5,times=p) 

R> N <- 1000 
R> set.seed (1212) 

R> res <- CEoptim(RSS.penalized,f.arg=list(X=X, Y=Y,lambda=lambda.10), 

continuous=list(mean=mu0,sd=sigma0,sdThr=0.00001),N=N) 

R> beta.CEoptim <- res$optimizer$continuous 
R> # Index of the non-zero coefficient 

R> (ind.beta.CEoptim.NZ <- which(abs(beta.CEoptim)>0.000001)) 

[1] 1 2 3 4 5 6 7 8 9 10 

R> beta.CEoptinm.NZ <- beta.CEoptim[ind.beta.CEoptim.NZ] 

R> (compare.beta.NZ <- rbind(beta.glmnet.NZ,beta.CEoptinm.NZ)) 

VI V2 V3 V4 V5 

beta.glmnet.NE 0.3900619 0.3934524 0.4079553 0.5751035 0.1877660 

beta.CEoptinm.NE 0.3631798 0.3826273 0.4419025 0.6014707 0.1639559 

V7 V8 V9 V10 

beta.glmnet.NE 0.029292247 0.5543562 0.5765673 0.5627972 

beta.CEoptinm.NE 0.005821685 0.5537388 0.5854710 0.6034473 


V6 

0.1955309 

0.1721753 
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The two methods give similar values for the non-zero coefficient, although they are not exactly 
the same. Note, however, that of the two solutions the one found by CEoptim gives the 
smaller value for the objective function + A||/3||i (where Residual Sum of Square: RSS= 
\\Y-X(3\\l). 

R> (RSS.penalized(beta.CEoptim,X=X,Y=Y,lambda=lambda.10)) 

[ 1 ] 1.990268 

R> (RSS.penalized(beta.glmnet,X=X,Y=Y,lambda=lambda.10)) 

[ 1 ] 1.993622 

Further, we compare the results obtained by CEoptim with the ones given by glmnet for the 
sequence of tuning parameter A used by default in the glmnet function. Results given by 
CEoptim are slightly better than glmnet optimizer (see Figure 5.6). In more than 90% of 
the cases (over the 73 values of A investigated) CEoptim gives a lower value for the objective 
function than glmnet. However, the coordinate descent algorithm (Friedman et al. 2008) used 
in glmnet is computationally less demanding than the CE approach. 
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Figure 9: Difference of the objective function values between glmnet and CEoptim for a 
sequence of 73 values of A. 


5.7. AR(1) model with regime switching 

As a final illustration of the use of CEoptim, we consider a model fitting problem involving 
both continuous and discrete variables. 

Let Y t be the added value of a stock at time £, at day t = 1,2,..., 300; that is, the increase 
(which may be negative) in stock price relative to the price at time t = 0. Let Xt be the 
increment at day t. Hence, 

t 

V t = Y J X i , t = 1,...,300. 

i— 1 
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We assume that the {X t } satisfy a zero-mean AR(1) model with three possibly different 
regimes. Specifically, we assume 


where 


Xi — Oi Xi-\ + £j, i — 1,..., 300 , 



flW , 
0 ( 2 ) , 


i = 1,... ,n 
* = n + 1,... ,r 2 
i = r 2 + 1,... ,300 , 


(14) 


1 ^ < 300, \6i\ ^ 1, i = 1,2,3, and the error terms {ffj} are iid and normally 

distributed with standard deviation o. The model thus has two discrete and three continuous 
parameters, as well as a nuisance parameter o. Define 6 = 0^ 2 \0 ^) T , r = (n,r2) T , 

and let x\,... ,£300 be the observed increments. We put xo = 0. We fit the parameters by 
minimizing the least squares function 


300 

L (0X) = Xi-Xif , 

1=1 

where x* is the fitted value d- L Xi- 1, and 6i is determined by G and r via (14). The vector 
of fitted values, say x, can be written in matrix notation as x = X6, where X is a 300 x 3 
matrix where the elements in rows 1 ,..., r\ in the first column are equal to xq, ..., x rj _i ; the 
elements in rows r\ + 1 ,..., in the second column are equal to x ri , ..., x r2 _ 1; the elements 
in rows + 1,..., 300 in the third column are equal to x r2 ,..., X299; and all other elements 
are 0. The implementation of the least squares function is given below. Note that the function 
requires input r — 1 rather than r, because each categorical variable used in CEoptim takes 
value in a set {0 ,..., c} for some c. 

R> sumsqrs <- function(theta,rml,x) { 

N <- length(x) #without x[0] 

r <- 1 + sort(rml) # internal end points of regimes 
if (r[1]==r[2]) { # test for invalid regime 
return(Inf); 

} 

thetas <- rep(theta, times=c(r,N)-c(l,r+l)+l) 
xhat <- c(0,head(x,-l))*thetas 
# Compute sum of squared errors 
sum ( (x-xhat) ~2) 

} 

The data have been generated using the parameters 6 = (0.3,0.9, —0.9), r = (100,200) and 
a = 0.1. The data are included in the package and are available by using: 

R> data(yt) 

R> xt <- yt - c(0,yt[-300]) 

The following code implements the use of CEoptim for this constrained mixed problem. 
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R> A <- rbind(diag(3),-diag(3)) 

R> b <- rep(1,6) 

R> set.seed(123) 

R> require(CEoptim) 

R> res <- CEoptim(f=sumsqrs, f.arg=list(xt), continuous=list (mean=c(0,0,0), 

sd=rep(l.0,3), conMat=A, conVec=b),discrete=list(categories=c(298L,298L), 
smoothProb=0.5),N=10000,rho=0.001, verbose=TRUE) 

The output is as follows: 

R> res 

Optimizer for continuous part: 

0.2702714 0.8801672 -0.8975874 
Optimizer for discrete part: 

99 199 
Optimum: 

2.675727 

Number of iterations: 

13 

Convergence: 

Variance converged 

As the input to CEoptim is r — 1, the optimal vector r is given by 
R> (est.r <- sort(res$optimizer$discrete)+l) 

[1] 100 200 

which gives exactly the “true” boundaries for the regimes. From the estimates of the model, 
one can assess the fit of the model by comparing y t with y t = D and xt against the fit 
xt- Figure 10 shows an excellent fit. 

R> t <- 1:300 

R> est.theta <- res$optimizer$continuous 

R> est.thetas <- rep(est.theta,times=c(est.r,300) - c(1,est.r+1) + 1) 

R> xfit <- c(0,head(xt,-1))*est.thetas 

R> par(mfrow=c(2,1)) 

R> plot (xt~t,type="l",col="blue") 

R> lines(xfit,col="red") 

R> abline(v=c(100,200)) 

R> plot(yt,type="l",col="blue") 

R> lines(cumsum(xfit),col="red") 

R> abline(v=c(100,200)) 

A diagnostic of the residuals is presented in Figures 11, showing a normal quantile plot (left 
panel) and a scatterplot of the residuals (right panel). 
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R> par(mfrow=c(l,2)) 

R> resid <- xfit - xt 

R> plot(resid,ylab="residuals",xlab="t") 
R> qqnorm(resid,ylab="residuals") 




Figure 10: Assessment of the fit of the model: xt (top) and yt (bottom). 
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Figure 11: Diagnostic residuals of the model: scatterplot of the residuals (left) and quantile 
quantile normal plot (right). 
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6. Concluding remarks 

CEoptim provides the R implementation of the cross-entropy method for optimization. The 
versatility and effectiveness of this new package have been illustrated through a variety of 
optimization example, involving continuous, discrete, mixed and constrained optimization 
problems. We have demonstrated how this simple algorithm can be of benefit in statis¬ 
tical inference, including model fitting, regression, maximum likelihood, and lasso meth¬ 
ods. CEoptim is available from the Comprehensive R Archive Network (CRAN) at http: 
//cran.r-proj ect.org/. 
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